Introduction to Open Data Science - Course Project

About the ODS project

The course is an open data science course for PhD students at the University of Helsinki. I'm especially expecting to learn some more R and R Markdown, and also to use GitHub.

The link to my GitHub repository is https://github.com/akarimo/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Thu Nov 26 20:25:21 2020"

The text continues here.


Chapter 2: Regression and model validation

In this chapter I report what I did for the regression analysis assignment. I used JYTOPKYS3 survey data collected by Kimmo Vehkalahti between 3.12.2014 and 10.1.2015. Below you can see the structure and dimensions of the data. It has information on the gender and age of the respondents, as well as on their learning skills, attitudes, and exam points in a statistical course exam. All other variables besides "gender", which is a dichotomous variable, are continuous variables. Variables "attitude", "deep", "stra", and "surf" are averages of 10 variables measured on a likert-scale. The analyzed version of the data includes 166 respondents, because respondents who didn't attend an exam were removed.

#read the data

learning2014 <- read.csv("~/Desktop/IODS/R project for ODS/IODS-project/data/learning2014.csv", row.names = 1)

#checking the structure and dimensions of the dataset

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166   7

In the plot below you can see scatterplot matrices, distributions, and correlations between the variables by gender. The information on females is coloured pink, and the information on males is coloured light blue. There are a lot more females in the dataset, and most respondents are under 40 years old. The box plots and distributions of the variables by gender seem otherwise quite similar, but it seems that females have more variance in their attitude compared to males. Also, the scores for attitude seem in general a bit lower for females compared to males. We can also see that there is a larger share of males with lower scores for the variable "surf" compared to females.

The correlations between variables are in general quite low, and not statistically significant. We can also see from the scatterplots, that the relationships between these variables seem mostly quite random. Variable "surf" is significantly negatively correlated with both "attitude" and "deep", but when we look at it by gender, we see that this correlation is only significant for males. Variable "surf" is also significantly negatively correlated with "stra" for all respondents, but we dont find it significant by gender. Variable "attitude" and variable "Points" are significantly positively correlated with each other for both males and females.

#loading required packages

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# creating a plot of the data
plot_learning2014 <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3),
             lower = list(combo = wrap("facethist", bins = 20)))

plot_learning2014

After studying the relationships of these variables graphically and through correlations, it seems that the "attitude" variable might be explaining the variation in exam points the best. It also seems possible, that "stra" and "surf" would be significantly associated with "Points" in a more advanced model. Below I present a summary of a multiple linear regression model explaining exam points with attitude, strategic learning ("stra"), and surface learning ("surf"). All of these variables are approximately normally distributed according to visual interpretation of their distribution. However, the scatterplots representing the relationship between "Points" and "stra", and "Points" and "surf" don't indicate a linear relationship between these variables.

# creating a regression model with attitude, strategic learning, and 
# surface learning as explanatory variables

model1 <- lm(Points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model

summary(model1)
## 
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

In a model summary, significant parameter estimates are typically marked with stars. As you can see from the model summary above, attitude is in fact significantly positively associated with exam points, but strategic learning and surface learning are not. Columns "t value" and "Pr(>|t|)" in the "Coefficients" table include the t-value and p-value of the parameters. These values correspond to a statistical test of a null hypothesis, that the actual value of the parameter estimate would be zero. It is in general accepted, that p-values under 0.05 indicate a statistically significant parameter estimate. In this case the p-value of the "attitude" parameter is very low, but for "stra" and "surf" it's nowhere close 0.05.

Because "stra" and "surf" are not significantly associated with "Points", I remove them from the model. You can see the results of the new model below.

#remove stra and surf and run model again

model2 <- lm(Points ~ attitude, data = learning2014)

# print out a summary of the model

summary(model2)
## 
## Call:
## lm(formula = Points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

From the model summary above you can see, that in this simple linear model the parameter estimate for the variable "attitude" is slightly higher than in the previous model. Also, in the "Residuals" table we see, that the residuals for this model are smaller than for the previous model, indicating a better model fit. The model fit can be further analyzed with the value of the Multiple R squared at the bottom of the model summary. In this case it's 0.19, indicating that the model can explain 19 percent of the variance in our dependent variable. In the case of this simple linear regression, this means that differences in attitude explain about a fifth of the variance in exam points.

To analyze how well this model fits the assumptions of linear regression, we produce some model diagnostics below. The assumtions of a linear regression include, that the errors have constant variance implying that the size of the errors is not dependent on the explanatory variables, that the errors are normally distributed, and that the errors are not correlated. From the "Residuals vs Fitted" plot in the top left corner we can see, that the relationship between the residuals and the fitted values is quite random, which indicates that the size of the errors is not dependent on the explanatory variable. In the "Normal Q-Q" plot we see that the errors are reasonably normally distributed, and thus fit the normality assumption, and the results in the "Residuals vs Leverage" plot imply, that no single observation has unusually high impact on the model. These model diagnostics imply a reasonably good fit to the data, so we can trust our results.

# draw diagnostic plots
par(mfrow = c(2,2))
plot(model2, which = c(1,2,5))

date()
## [1] "Thu Nov 26 20:25:33 2020"

Here we go again...


Chapter 3: Logistic regression

In this chapter I report the logistic regression exercise. I'm using data from UCI Machine Learning Repository. The data has information on student grades, demographic, social and school related features for students in two Portuguese schools. Below is a list of attributes in the data.

#read the data

alc <- read.csv("~/Desktop/IODS/R project for ODS/IODS-project/data/alc.csv")

#print names of variables
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"

I'm studying the effect of sex, age, number of school absences, and final grade (G3) on high alcohol use. My hypotheses are:

Hypothesis 1: male students have a bigger risk of high alcohol use compared to female students Hypothesis 2: the older the student, the bigger the risk of high alcohol use Hypothesis 3: the more school absences, the bigger the risk of high alcohol use Hypothesis 4: the lower the grade, the bigger the risk of high alcohol use

library(ggplot2)

#barplot sex and high use
ggplot(data = alc, aes(x = sex, fill = high_use)) + geom_bar()

#barplot age and high use
ggplot(data = alc, aes(x = age, fill = high_use)) + geom_bar()

#boxplot absences and high use
ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot()

#cross-tabulation grades and high use
table(highuse = alc$high_use, grades = alc$G3)
##        grades
## highuse  0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
##   FALSE  1  0  1  6  4 15  4 19  3 38 16 54 19 31 15 30  5  7
##   TRUE   1  1  0  1  2  7  2  8  4 26 10 27  7  7  1  8  2  0

From the plots above we can see, that it does seem, that high use is more common among male students compared to female students, but the relationship with high use and age is not clear based on the bar plot. Based on the boxplot it seems, that high use is indeed to some extent related to higher amonts of absences, and based on the cross-tabulation it seems, that students with higher grades are less likely to belong in the group of high users.

Next I will fit the logistic regression model. From the model summary below we can see, that sex and absences are significantly associated with high alcohol use. Odds ratios in the table below show, that males are approximately 2.68 times more likely to be in the category of high use compared to females. Also, one point increase in absences increases the odds of being in the high use category by approximately 1.06. Age or grade (G3) are not statistically significantly associated with high alcohol use.

# fit the model
model1 <- glm(high_use ~ sex + age + absences + G3, data = alc, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = high_use ~ sex + age + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2729  -0.8326  -0.6292   1.0442   2.1190  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.38339    1.84796  -1.831 0.067120 .  
## sexM         0.98580    0.24184   4.076 4.58e-05 ***
## age          0.13985    0.10328   1.354 0.175690    
## absences     0.08964    0.02309   3.882 0.000104 ***
## G3          -0.06631    0.03624  -1.830 0.067277 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 423.66  on 377  degrees of freedom
## AIC: 433.66
## 
## Number of Fisher Scoring iterations: 4
library(tidyr); library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# compute odds ratios (OR)
OR <- coef(model1) %>% exp

# compute confidence intervals (CI)
CI <- confint(model1) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR        2.5 %   97.5 %
## (Intercept) 0.03393229 0.0008683319 1.237502
## sexM        2.67996134 1.6781593816 4.338864
## age         1.15010522 0.9401979286 1.410759
## absences    1.09377649 1.0474354900 1.146836
## G3          0.93584524 0.8710385897 1.004426

Next, we look at the predictive power of the model with only sex and absences as predictors.

library(dplyr)

# fit a new model with only significant predictors
model_pred <- glm(high_use ~ sex + absences, data = alc, family = "binomial")


# predict() the probability of high_use
probabilities <- predict(model_pred, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE     88   26
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()


Chapter 4: Clustering and classification

First, I load the data and look at the structure and dimensions. The dataset includes data on per capita crime rate by town (crim), and different variables regardin the residents, homes, and businesses in these towns. The data has 506 observations and 14 variables.

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the structure and dimensions of the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

In the plot below we can see how these variables correlate with each other. Negative correlations are marked red, and positive correlations with blue. We can see that there are quite strong positive correlations for example between proportion of non-retail business acres per town (indus) and nitrogen oxides concentration (nox), as well as between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax). There are strong negative correlations between weighted mean of distances to five Boston employment centres (dis) and indus, nox, and proportion of owner-occupied units built prior to 1940 (age).

Summaries of these variables show that the scales of these variables vary a lot. Some variables range between 0 and 1 (e.g. chas), while others range between 187 and 711 (tax).

library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.3     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) 

#save rounded matrix
cor_matrix <- cor_matrix %>% round(digits = 2)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

#summaries of the variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

To make the variables more comparable I standardize them. Below we can see that the scaled variables are now centered around zero.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Next, I create a categorical variable of the crime rate.

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

And divide the dataset to train and test sets, so that 80% of the data belongs to the train set

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Next, I fit a linear discriminant analysis on the training dataset. In the plot below we see the results of the analysis, where the observatios are coloured according to the categorical crime rate.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow.heads = 0.1, color = 'purple', tex = 0.75, choices = c(1,2)) {
  heads = coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale*heads[,choices[1]],
         y1 = myscale*heads[,choices[2]],
         col = color, length = arrow.heads)
  text(myscale*heads[,choices], labels = row.names(heads), cex = tex, col = color, pos = 3)
}
classes = as.numeric(train$crime)

#plot the result
plot(lda.fit, dimen = 2, col = classes)
lda.arrows(lda.fit, myscale = 2)

I save the correct crime rate classes from the data.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

And predict the classes with the test data. In the table below we see that the "high" category is predicted perfectly, but more errors are made with the other categories. For example from the "med_high" category, only 14 observations are predicted to be in the correct category, while 14 are predicted to be in the "med_low" category, and 3 in the "high" category.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18       8        2    0
##   med_low    6      17        8    0
##   med_high   1       4       11    4
##   high       0       0        0   23

Next, I reload the dataset, scale the variables again, and look at euclidean distance between observations.

# reload the data
data(Boston)
# center and standardize variables
boston_scaled <- scale(Boston)
boston_scaled <- data.frame(boston_scaled)

# euclidean distance matrix
eu_dist <- dist(boston_scaled)

Before running a k-means algorithm, I investigate what is the optimal number of clusters. In the plot below we see, that the optimal number of clusters is 2, because the total WCSS drops radically.

set.seed(123) #set seed to ensure results are replicable

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results

library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')

Next I run the k-means algorithm with 2 clusters and plot the clusters. In the plot we see that the clusters seem to differ especially regarding the variables indus, nox, age, dis, rad, tax, ptratio, and lstat.

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the scaled Boston dataset with clusters
library(GGally)
ggpairs(boston_scaled, mapping = aes(col = as.factor(km$cluster), alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero


Chapter 5: Dimensionality reduction techniques

In this chapter I use the "human" data, more info can be found in https://raw.githubusercontent.com/TuomoNieminen/Helsinki-Open-Data-Science/master/datasets/human_meta.txt

The data includes several indicators from most countries in the world; the ratio of females and males with at least secondary education, the ratio of females and males in the labour force, expeted years of schooling, life expectancy at birth, Gross National Income per capita, maternal mortality ratio, adolescent birth rate, and percentage of female representatives in parliament.

human <- read.csv("~/Desktop/IODS/R project for ODS/IODS-project/data/human.csv", row.names = 1)

summary(human)
##    edu_ratio        lab_ratio         exp_edu         exp_life    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            mat_mor         ado_birth         rep_parl    
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

In the plots below we see that some of the variables are approximately normally distributed (e.g. expected years of schooling), while the distribution of other variables is very much skewed. The ratio of females and males with at least secondary education, expeted years of schooling, life expectancy at birth, and Gross National Income per capita are all positively correlated with each other, and negatively correlated with maternal mortality ratio and adolescent birth rate. The ratio of females and males in the labour force and percentage of female representatives in parliament aren't correlated with anything.

library(GGally)
library(dplyr)
library(corrplot)
ggpairs(human)

cor(human)%>%corrplot()

Principal component analysis

Based on the results below, principal component analysis doesn't seem to work with unstandardized data. The fist principal component seems to capture almost all of the variance, and it's heavily correlated with GNI.

pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Next I standardize the data and re-do the analysis to see if it helps. From the results below we see, that now the first principal component captures about 54 percent of the variability, and the second principal component about 16 percent. The ratio of females and males with at least secondary education, expeted years of schooling, life expectancy at birth, Gross National Income per capita, maternal mortality ratio and adolescent birth rate seem to contribute to the first principal component, while the ratio of females and males in the labour force and percentage of female representatives in parliament contribute to the second principal component.

Based on these results and how the countries are situated in the biplot, the first principal component seems to capture mostly the wealth of the country. The second component captures some aspects of gender equality.

human_std <- scale(human)

pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

Multiple Correspondence Analysis

In this exercise I use the "tea" dataset. There are 36 variables in the dataset, so I will only keep variables "Tea", "How", "how", "sugar", "where", and "lunch".

library(FactoMineR)
library(tidyr)

data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Results of the multiple correspondence analysis (MCA) below show, that the first dimension holds about 15 percent of the variance, and the second about 14 percent of the variance. There are 11 dimensions in total, and all of them retain some of the variability. The biplot shows variables drawn on the first two dimensions. Categories of the same variable are coloured with the same colour. The distance between the variable categories gives a measure of their similarity. For example in the bottom right corner we see that people who use unpacked tea buy their tea from a tea shop rather than chain store.

mca <- MCA(tea_time, graph = FALSE)

summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")


(more chapters to be added similarly as we proceed with the course!)